December 09, 2020


Background

Open Access Coronavirus Disease Epidemiological Data

Johns Hopkins University

The Center for Systems Science and Engineering (CSSE) at Johns Hopkins University provides a public, global COVID-19 Github repository (https://github.com/CSSEGISandData/COVID-19) with anonymous patient data aggregated from a number of sources.

We have built a centralised repository of individual-level information on patients with laboratory-confirmed COVID-19 (in China, confirmed by detection of virus nucleic acid at the City and Provincial Centers for Disease Control and Prevention), including their travel history, location (highest resolution available and corresponding latitude and longitude), symptoms, and reported onset dates, as well as confirmation dates and basic demographics. Information is collated from a variety of sources, including official reports from WHO, Ministries of Health, and Chinese local, provincial, and national health authorities. If additional data are available from reliable online reports, they are included. Data are available openly and are updated on a regular basis (around twice a day).

CSSE Data Sources (partial list):

The CSSE data are used for all global analyses in this document.

The New York Times

The New York Times has also provided public human coronavirus disease case and death data for the United States by county and by state. The U.S. data used for this analysis are pulled directly from The New York Times COVID-19 Github repository (https://github.com/nytimes/covid-19-data).

The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.


Data Analysis

The COVID-19 data from both the John Hopkins and New York Times repositories are pulled and used to calculate the rate of new reported cases for each country and the rates of new reported cases and deaths for each U.S. state and county. These rates are used to generate a predictive regression model for each locale. A risk prediction (ρ) is generated from these models, and the countries, states, and counties with the highest predicted risk are compared in the charts in this document. In the U.S. case-death charts, a generalized additive model (GAM) smoothing function is fit to each data set to make it easier to visualize trends.

The risk assessment methodology used in this analysis has not been fully validated and is affected by noise in the data. There is a phenomenon that has been reported in White House press briefings in which some counties report updates on Mondays for the incremental changes over the weekend. Cyclical weekly variation can be observed in the data. This limits the accuracy of the model predictions. To increase prediction robustness, the model has been tuned to use data over a multi-day period as a compromise between the speed of the detection of a relevant changes in risk predictions and prediction error caused by sensitivity to noise.

The predictive analytics model is built with the open-source R programming language using the Tidyverse family of packages.




Summary Results

World

There are 191 countries represented in the Johns Hopkins University data set. The Gross Domestic Product (GDP) data shown above represents per capita GDP at purchasing power parity (PPP) in international (Geary-Khamis) dollars. These data are obtained from the Countries by GDP (PPP) per capita (Wikipedia) web page. Only countries with a risk prediction value above 25 are shown.




U.S.

There have been 15,248,865 total COVID-19 cases (220,225 new cases per day) and 286,443 deaths (2,597 new deaths per day) in the United States from January 21, 2020 to December 08, 2020.







Individual States

29 states currently have risk predictions above 25.






Counties

There are 3,220 U.S. counties represented in the New York Times data set.






Community Mobility Data

For the purpose of assisting the global COVID-19 pandemic response, Google has made available detailed mobility estimates relative to local baselines obtained from mobile phone and other data of the type used by traffic, etc., services like Google Maps and Waze. The data are provided by Google in the form of Community Mobility Reports.

As global communities respond to COVID-19, we’ve heard from public health officials that the same type of aggregated, anonymized insights we use in products such as Google Maps could be helpful as they make critical decisions to combat COVID-19.

These Community Mobility Reports aim to provide insights into what has changed in response to policies aimed at combating COVID-19. The reports chart movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.

The data used for the analysis below is current through December 06, 2020.




U.S.


Note: The dotted grey line on each of the mobility charts represents the date (March 13, 2020) on which the U.S. declared a National Emergency Concerning the Novel Coronavirus Disease (COVID-19) Outbreak.




Individual States







Face Mask Data

On July 28, 2020, the New York Times released estimates of face mask usage by county calculated from nationwide responses to the survey question “How often do you wear a mask in public when you expect to be within six feet of another person?”. The data was collected from July 2 to July 14, 2020.

This data comes from a large number of interviews conducted online by the global data and survey firm Dynata at the request of The New York Times. The firm asked a question about mask use to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. (Several states have imposed new mask requirements since the completion of these interviews.)

An aggregate score was computed from the New York Times data for each U.S. county using a weighted average. State aggregate scores were then calculated using the mean county scores for each state.

The chart below shows predicted risk based on analysis of the state case data compared with face mask usage for all states with moderate-to-high risk predictions (greater than 5) on July 30, 2020. The intent here is not to find any causal relationship. Some states have high mask usage because of high numbers of confirmed cases locally, and some states may have low local case numbers because of relative high mask usage. The data may indicate, however, some level of additional risk for states with high predicted risk based on case data and low mask usage numbers (as of July 30, 2020). Oklahoma and Missouri stand out in this regard, although it is reasonable to expect that mask usage will increase in response to rising cases. (States with predicted risk greater than 25 and mask usage less than 50% are shown in yellow.)








Data Abnormalities

Analysis of the New York Times reported death data for the U.S. reveals a repeating weekly pattern in which the updates on Sunday and Monday are consistently lower than those reported on the other days of the week. As mentioned in the data analysis description in the Background section, the risk prediction algorithm has been configured to reduce the effect of this variation on the statistical model.